-
Notifications
You must be signed in to change notification settings - Fork 764
KB Document metadata #6024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KB Document metadata #6024
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6a85254
to
4cb8191
Compare
kitsune/messages/migrations/0002_inboxmessage_to_group_outboxmessage_to_group.py
Outdated
Show resolved
Hide resolved
kitsune/messages/migrations/0003_remove_inboxmessage_to_group.py
Outdated
Show resolved
Hide resolved
2566ba9
to
9841862
Compare
Add django.contrib.humanize Update scss Update view to pass `helpful_votes` and `product` Update `document_meta` block in `document_macros.html` Include `document_meta` in the `content_block`
20d216d
to
639915c
Compare
- Modify `product/documents.html` to display metadata - Modify scss to support layout needed - Modify `facets.py` to return metadata needed
639915c
to
acbf68a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only other thing I'd consider -- in addition to my comments below -- is that when you release this code -- which has a new data structure for the votes_dict
cached under the votes_cache_key
-- the prior cached value will clash with your current code. It might be worth considering how to handle that so it doesn't cause errors during the period until the cached content expires.
) | ||
.values("revision_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't remove .values("revision_id")
because it's required to group the later count annotations by revision, in other words, to generate vote counts per revision. It's a subtle thing with Django Querysets, but that's what values()
does when used before an aggregation (see https://docs.djangoproject.com/en/5.0/topics/db/aggregation/#values).
Without it, the count annotations simply count the votes per vote, which doesn't make sense.
For example, within my local dev environment, without the .values("revision_id")
clause, I get the following votes_query
rows (notice mulitple rows with the same revision_id
):
{'revision_id': 23, 'helpful_count': 0, 'total_count': 1, 'time_limited_count': 0}
{'revision_id': 23, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 5, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 18, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 1, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 18, 'helpful_count': 0, 'total_count': 1, 'time_limited_count': 0}
but with the .values("revision_id")
clause included I get:
{'revision_id': 1, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 5, 'helpful_count': 1, 'total_count': 1, 'time_limited_count': 1}
{'revision_id': 18, 'helpful_count': 1, 'total_count': 2, 'time_limited_count': 1}
{'revision_id': 23, 'helpful_count': 1, 'total_count': 2, 'time_limited_count': 1}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great catch. I literally thought it was redundant, but now I understand its role. Thanks for this.
kitsune/wiki/facets.py
Outdated
.annotate(count=Count("*")) | ||
.values("revision_id", "count") | ||
.annotate( | ||
helpful_count=Count(Case(When(helpful=True, then=1), output_field=IntegerField())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're just counting a specific condition (when helpful=True
), Django provides a simpler way to do this:
helpful_count=Count("id", filter=Q(helpful=True)),
kitsune/wiki/facets.py
Outdated
time_limited_count=Count( | ||
Case( | ||
When( | ||
created__range=(Now() - timedelta(days=30), Now()), | ||
helpful=True, | ||
then=1, | ||
), | ||
output_field=IntegerField(), | ||
) | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, this can be simplified to:
time_limited_count=Count("id", filter=Q(helpful=True, created__range=(Now() - timedelta(days=30), Now())))
kitsune/wiki/facets.py
Outdated
row["revision_id"]: { | ||
"helpful_count": row["helpful_count"], | ||
"total_count": row["total_count"], | ||
"time_limited_count": row["time_limited_count"], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're going to retrieve all 3 of these values later on, I would recommend simply storing the 3 values as a tuple. Also, a tuple is much more memory efficient -- it uses more than 50% less memory in this case -- and the votes_dict
will be quite large.
row["revision_id"]: (row["helpful_count"], row["total_count"], row["time_limited_count"])
and then later you can just retrieve them all like:
total_helpful_votes, total_votes, time_limited_helpful_votes = votes_dict.get(d.current_revision_id, (0, 0, 0))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another great catch. Thanks.
kitsune/wiki/facets.py
Outdated
total_votes = votes_dict.get(d.current_revision_id, {}).get("total_count", 0) | ||
total_helpful_votes = votes_dict.get(d.current_revision_id, {}).get("helpful_count", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To match my comment above:
total_helpful_votes, total_votes, time_limited_helpful_votes = votes_dict.get(d.current_revision_id, (0, 0, 0))
kitsune/wiki/facets.py
Outdated
helpful_votes=votes_dict.get(d.current_revision_id, {}).get( | ||
"time_limited_count", 0 | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, to match the comments above:
helpful_votes=time_limited_helpful_votes,
votes_query = ( | ||
HelpfulVote.objects.filter( | ||
revision_id__in=qs.values_list("current_revision_id", flat=True), | ||
created__range=(datetime.now() - timedelta(days=30), Now()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the created__range
filter here worries me, because it makes this votes_query
much more expensive. The HelpfulVote
table is one of our largest tables, if not the largest, with more than 56 million rows collected over 12 years, and growing.
When I ran the votes_query
in production without the created__range
filter, it took a few seconds (about 3-4 seconds) -- the query as it is today is practically instantaneous -- and that query time will grow as the table grows over time. That could be a problem when the cache is cold and the user has to wait several seconds for a fresh query to complete in order for the results to display on a products page.
We may need to compromise and limit the votes_query
to only consider votes within the past 4-5 years, to avoid delayed page results for the cold cache case. When I tested the query in production with a created__range
filter that limits the query to the past 5 years, it was still relatively fast (less than a second). Given that most current document revisions will probably be less than 4-5 years old and also given that restricting the query by created__range
prevents the cost of the query from growing as the table grows, it might be best to use a fixed created__range
.
If we do decide to restrict the query to the past 4-5 years, we should probably use the same limit when gathering the votes for the document view.
I think even if we decide against restricting the query to the past 4-5 years, we should at least restrict the query to something -- like maybe 10 years -- just to prevent the cost of the query growing over time as the HelpfulVote
table grows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, the current revision of more than a third of all of the active KB articles is older than 5 years.
In [37]: Document.objects.filter(current_revision__created__lte=date(2019, 1, 1), is_archived=False, is_template=Fa
...: lse).count()
Out[37]: 13696
In [38]: Document.objects.filter(current_revision__isnull=False, is_archived=False, is_template=False).count()
Out[38]: 37062
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could reduce the size of the HelpfulVote
table by about 25% by removing votes on revisions of documents that have been archived.
In [40]: HelpfulVote.objects.filter(revision__document__is_archived=True).count()
Out[40]: 14709060
In [41]: HelpfulVote.objects.count()
Out[41]: 56457295
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth more discussion. I'm not sure what the right answer is in the short term - I can't produce a % of votes without the total votes. I've excluded archived documents in the query.
If this turns out to be too intensive we can look at batch processing for totals and storing them somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps for now, we shouldn't worry about it? We're already caching the results for 24 hours, so only a single request every 24 hours may experience the "cold cache" delay.
kitsune/wiki/views.py
Outdated
@@ -271,12 +271,22 @@ def document(request, document_slug, document=None): | |||
breadcrumbs.append((product.get_absolute_url(), product.title)) | |||
# The list above was built backwards, so flip this. | |||
breadcrumbs.reverse() | |||
votes = HelpfulVote.objects.filter(revision=doc.current_revision).aggregate( | |||
total_votes=Count("id"), | |||
helpful_votes=Sum(Case(When(helpful=True, then=1), default=0, output_field=FloatField())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed this in my first review, but this can be simplified too:
helpful_votes=Count("id", filter=Q(helpful=True)),
<div id="document_metadata" class="has-border-top"> | ||
<span class="product">{% for product in document['products'] %}{{ product }}{% endfor %}</span> | ||
<span class="last-updated"> | ||
<img class="pencil" src="{{ webpack_static('sumo/img/pencil.svg') }}" /> <strong>Last updated:</strong> <span class="time">{{ document['created']|naturaltime }}</span> | ||
</span> | ||
{% if document['percent_helpful']>0 %} | ||
<span class="helpful-info"> | ||
<img class="thumbsup" src="{{ webpack_static('sumo/img/thumbs-up.svg') }}"/><span class="helpful-count">{{ document['percent_helpful'] }}%</span> of users found this helpful | ||
</span> | ||
{% endif %} | ||
</div> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the document_metadata
macro here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I had talked myself out of that since it lives in a different app. I have now altered it and implemented it.
kitsune/wiki/facets.py
Outdated
created__range=(datetime.now() - timedelta(days=30), Now()), | ||
helpful=True, | ||
) | ||
HelpfulVote.objects.filter(revision__document__is_archived=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this .filter(revision__document__is_archived=False)
is necessary, given that we're already limiting the revisions to those within the qs
and the qs
is already filtering out the revisions of archived documents. It may even cause the votes_query
to be a bit more costly since now it has to join to both the Revision
and Document
tables, so I would recommend removing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that in addition to removing .filter(revision__document__is_archived=False)
from the votes_query
, you probably should retrieve/store the votes_dict
from/in a different key -- maybe just version the key -- so that the old data in the cache doesn't cause failures for the initial period of up to 24 hours (the cache timeout) after the release of this new code. Maybe something like:
...
votes_cache_key = f"votes_for_v2:{cache_key}"
if votes_dict is None:
...
@@ -94,6 +96,7 @@ | |||
{% endif %} | |||
<article class="wiki-doc"> | |||
{{ document_messages(document, redirected_from) }} | |||
{{ document_metadata(document.current_revision.created, helpful_votes, products, metadata_type) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. Maybe use document_metadata(..., metadata_type="document")
or document_metadata(..., "document")
instead of having to define a separate Jinja variable with {% set metadata_type = 'document' %}
?
@@ -115,6 +117,7 @@ <h2 class="sumo-card-heading"> | |||
</a> | |||
</h2> | |||
<p>{{ document['document_summary'] }}</p> | |||
{{ document_metadata(document['created'], document['percent_helpful'], document['products'], metadata_type) }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice that you were able to use the macro here!
Nit. Maybe use document_metadata(..., metadata_type="list")
or document_metadata(..., "list")
instead of having to define a separate Jinja variable with {% set metadata_type = "list" %}
?
c6859dd
to
ed56a7c
Compare
ed56a7c
to
7b58ce9
Compare
After a SuMO platform team discussion, we decided not to show the helpful votes metadata for the "list" of documents case in order to avoid the performance issue for now. We will show the helpful votes metadata for the single document case. We also decided to truncate the products metadata -- a comma-separated list of product titles associated with the document -- to 50 characters, although when the helpful votes metadata is included, I've reduced it to 25 characters in order to avoid overflow when the last updated metadata is long for cases like |
Enable meta data for KB articles and KB article listing pages:
django_jinja.contrib._humanize
app in order to give us friendly last updated times instead of reinventing the wheel.view.py
andfacets.py
files to support passing data needed